GraphBuilder: A Scalable Graph ETL Framework

نویسندگان

  • Nilesh Jain
  • Guangdeng Liao
  • Theodore L. Willke
چکیده

Graph abstraction is essential for many applications, from finding a shortest path to executing complex machine learning (ML) algorithms like collaborative filtering. However, constructing graphs from relationships hidden within large unstructured datasets is challenging. Since graph construction is a data-parallel problem, MapReduce is well-suited for this task. We developed GraphBuilder, an open source scalable framework for graph Extract-Transform-Load (ETL), to offload many of the complexities of graph construction, including graph construction, transformation, normalization, and partitioning. GraphBuilder is written in Java, for ease of programming, and it scales using the MapReduce model. In this paper, we describe the motivation for GraphBuilder, its architecture, MapReduce algorithms for graph processing, and a performance evaluation of the framework. Since large graphs should be partitioned over a cluster for storing and processing and partitioning methods have a significant impact on performance, we develop several graph partitioning methods and evaluate their performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

GraphBuilder – A Scalable Graph Construction Library for ApacheTM HadoopTM

The exponential growth in the pursuit of knowledge gleaned from data relationships that are expressed naturally as large and complex graphs is fueling new parallel machine learning algorithms. The nature of these computations is iterative and data-dependent. Recently, frameworks have emerged to perform these computations in a distributed manner at commercial scale. But feeding data to these fra...

متن کامل

ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce

Extract-Transform-Load (ETL) flows periodically populate data warehouses (DWs) with data from different source systems. An increasing challenge for ETL flows is processing huge volumes of data quickly. MapReduce is establishing itself as the de-facto standard for large-scale data-intensive processing. However, MapReduce lacks support for high-level ETL specific constructs, resulting in low ETL ...

متن کامل

CloudETL: Scalable Dimensional ETL for Hadoop and Hive

Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a D...

متن کامل

Systematic ETL management - Experiences with high-level operators

Large organizations load much of their data into data warehouses for subsequent querying, analysis, and data mining. Extract-Transform-Load (ETL) workflows populate those data warehouses with data from various data sources by specifying and executing a set of transformations forming a directed acyclic transformation graph (DAG). Over time, hundreds of individual ETL workflows evolve as new sour...

متن کامل

Rule-Based Management of Schema Changes at ETL Sources

In this paper, we visit the problem of the management of inconsistencies emerging on ETL processes as results of evolution operations occurring at their sources. We abstract Extract-Transform-Load (ETL) activities as queries and sequences of views. ETL activities and its sources are uniformly modeled as a graph that is annotated with rules for the management of evolution events. Given a change ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013